This notebook demostrates a way to convert temporal graph data of flights into a time series that can be analysed using signature methods.
We aim to achieve 2 goals:
To achieve these goals, we will be splitting the time series of airplane flights into small chunks of approximately 1 week and identifying each one as either anomalous or not. To do that, we convert the information stored in the flight data into a fixed dimensional path and use path signature as it's vector representaion.
%load_ext autoreload
%autoreload 2
import pandas as pd
import esig
import numpy as np
import random
import matplotlib.pyplot as plt
import itertools
import zipfile
import pickle
import iisignature
import sqlite3
import ruptures as rpt
from datetime import datetime
import sys
sys.path.append("..")
from signature_calculation import calculate_signatures
from path_calculation import construct_paths
from path_calculation import construct_single_path_edges
from path_calculation import construct_single_path_vertices
import outlier_calc
import clustering
from variance import variance
day = 86400
LOCKDOWN_SIGNATURES_PATH = 'signatures/key.pkl'
CONFLICT_SIGNATURES_PATH = 'signatures1/key.pkl'
DUMMY_SIGNATURES_PATH = 'dummy_signatures/key.pkl'
DUMMY_SIGNATURES_PATH2 = 'dummy_signatures/2_key.pkl'
DUMMY_CLEAN_SIGNATURES_PATH = 'dummy_signatures/key_clean.pkl'
This class is used to give all the functions model hyperparameters. The parameters represent:
class hyperparameters:
normalising_data = False
scale_data = False
preprocess_data_columnwise = False
normalising_signature = False
scale_signature = False
covariance_method = "Emperical"
knn_metric = 'minkowski'
observed = tuple(["EGLL", "EGCC", "EGPF", "EGPH", "EGKK"])
depth = 2
time_variable = True
daytime_variable = True
stream_length = 7 * day
stream_spacing = 2 * day
mode = "vertices"
extrapollation = False
lead_lag = False
directional_edges = True
neighbours = 5
We use 2 types of data:
The data is presented as follows:
query = ''' SELECT *
FROM Flights
LIMIT 5'''
conn = sqlite3.connect("file:flights.db?nolock=1", uri=True)
events = pd.read_sql_query(query, conn)
conn.close()
print(events)
address altitude callsign date destination_iata destination_icao \ 0 000000 0 RA78039 2019-06-10 None None 1 000000 0 RA78039 2019-06-25 None None 2 5153AD 0 HKYO 2019-06-08 None None 3 000000 39875 GLEX 2019-06-01 None ZBAA 4 000000 0 GLEX 2019-06-01 None ZBAA equipment event flight flight_id latitude longitude operator \ 0 None gate_departure RA78039 547322955 59.8048 30.2531 None 1 None gate_departure RA78039 547322955 59.8048 30.2531 None 2 GRND gate_departure None 547323363 38.5831 68.8157 None 3 GLEX descent None 548767257 61.8015 29.3968 None 4 GLEX landed None 548767257 60.3155 24.9797 None origin_iata origin_icao registration time 0 LED ULLI None 06:11:10 1 LED ULLI None 05:04:45 2 None None None 03:04:19 3 None VHHH None 00:18:28 4 None VHHH None 00:49:21
The construct_path package works with series of events of the form ["Airport ID", "timestamp", "takeoff/landed", "flight id"], so we use SQL query to get the data we need and convert it to the needed format using pandas.
query= '''SELECT origin_icao, date, time, event, flight_id
FROM Flights
WHERE destination_icao IN {} AND origin_icao IN {} AND event = 'takeoff'
UNION
SELECT destination_icao, date, time, event, flight_id
FROM Flights
WHERE destination_icao IN {} AND origin_icao IN {} AND event = 'landed' '''.format(*([hyperparameters.observed] * 4) )
conn = sqlite3.connect("file:flights.db?nolock=1", uri=True)
events = pd.read_sql_query(query, conn)
conn.close()
def datetime_convert(row):
date = row['date']
time = row['time']
date = date[2:]
date_time = date + " " + time
date_time_obj = datetime.strptime(date_time, '%y-%m-%d %H:%M:%S')
return int(round(date_time_obj.timestamp()))
events['timestamp'] = events.apply(lambda row: datetime_convert(row), axis = 1)
events = events[['origin_icao', 'timestamp', "event", "flight_id"]]
events = events.rename(columns = {'origin_icao': 'icao'})
events = events.rename(columns = {'flight_id': 'id'})
events = events.sort_values("timestamp")
Dummy data has 3 types of anomalies:
from generator import generate_data
generate_data()
dummy_clean_events = pd.read_csv('clean_data.csv')
dummy_anomalous_events_labels = pd.read_csv('data_anomaly_labels.csv')
dummy_anomalous_IDs = set(dummy_anomalous_events_labels['id'].values.tolist())
dummy_events = pd.read_csv('data.csv')
We focus on converting temporal graph data into a fixed dimansional time series while retaining valuable information about order of events. For the purposes of this notebook we will be splitting stream into fixed length substreams and identifying those as anomalous, however it is theoretically possible to use anomaly detection on the stream itself. Primary focus of this methodology is to detect unusual and potentially malicious activity in observed regions, such as military troop transfers; this methodology however could also be used for classification - either classifying nodes (one example would be detecting hub airports) or time periods (detecting change related to covid restrictions).
We will consider the original graph to have timed edges, which could also be thought of as messages being passed between pairs of nodes. In case of flight data, an edge between nodes A and B lasting between time x and y represents an airplane which departs from A at time x and arrives at B at time y. There are 2 natural ways to represent this data as a time series - vertex focused and edge focused. =
In vertex case, we consider arrivals and departures at a given node as 2 sepparate dimensions and keep cumulative counts of each. This approach uses 2n dimensions where n is the number of vertices, but looses the information connecting arrivals and departures as 2 endpoints of the same event. It is potentially valuable when observing large subgraphs and when interactions of flights at an airport are more important than more global topological properties (e.g. when trying to detect chaining of flights).
In edge case, we consider each edge (pair of airports) as a dimension and record number of flights between airports at any given time. This could either be encoded in the same dimension, with one direction being negative and other positive or in 2 dimensions if there are often flights in both directions at the same time. Downside of this method is that it uses O(n^2) dimensions. It does, however, retain more information. There is also less connection between flights arriving/departing at the same airport as one airport has data present in multiple dimensions.
After obtaining the substreams, we will be using path signatures (see https://github.com/pafoster/path_signatures_introduction) as a vectorisation of that data. Path signatures are uniquely fitted to deal with this sort of data, as potential anomalous activity is very sensitive to order of events.
One dimension of edge path against time:
path = construct_single_path_edges(events, hyperparameters)
plt.plot([path[i][-1] for i in range(500)], [path[i][2] for i in range(500)])
plt.show()
One dimension of vertex path against time:
path = construct_single_path_vertices(events, hyperparameters)
plt.plot([path[i][-1] for i in range(500)], [path[i][5] for i in range(500)])
plt.show()
We save signatures of subpaths for each relevant combination of hyperparameters as pickle files for use later.
def calculate_signatures_by_hyperparameters(events, SIGNATURES_PATH, anomalous_IDs = []):
print("calculating signatures")
for stream_spacing in [3]:
for stream_length in [10]:
for scaling in [True, False]:
hyperparameters.scale_data = scaling
hyperparameters.stream_spacing = stream_spacing * day
hyperparameters.stream_length = stream_length * day
for mode in ["edges", "vertices"]:
for param_combination in itertools.product(*([[False, True]] * 3)):
hyperparameters.mode = mode
hyperparameters.daytime_variable = param_combination[0]
hyperparameters.time_variable = param_combination[1]
hyperparameters.extrapollation = param_combination[2]
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+ ', Data Scaling={}'.format(scaling) + ', Length=' + str(stream_length) + ', Spacing = ' + str(stream_spacing) + ', Mode = ' + mode
print('.', end='')
if mode == "vertices":
streams, anomaly_count, timestamps = construct_paths(events, anomalous_IDs, hyperparameters)
streams = streams[1:-6]
anomaly_count = anomaly_count[1:-6]
timestamps = timestamps[1:-6]
labels = [anomaly_count[i] > 0 for i in range(len(anomaly_count))]
with open(SIGNATURES_PATH.replace('key', key + " TIMESTAMPS"), 'wb') as f:
pickle.dump(timestamps, f)
with open(SIGNATURES_PATH.replace('key', key + " STREAMS"), 'wb') as f:
pickle.dump(streams, f)
if anomalous_IDs:
with open(SIGNATURES_PATH.replace('key', key + ' LABELS'), 'wb') as f:
pickle.dump(labels, f)
for depth in [2, 3]:
key_ = key + ', Depth = ' + str(depth)
hyperparameters.depth = depth
streams_signatures = calculate_signatures(streams, hyperparameters)
with open(SIGNATURES_PATH.replace('key', key_), 'wb') as f:
pickle.dump(streams_signatures, f)
else:
for directional_edges in [True, False]:
hyperparameters.directional_edges = directional_edges
key_ = key + ', Directional_edges = {}'.format(directional_edges)
streams, anomaly_count, timestamps = construct_paths(events, anomalous_IDs, hyperparameters)
streams = streams[1:-6]
anomaly_count = anomaly_count[1:-6]
timestamps = timestamps[1:-6]
labels = [anomaly_count[i] > 0 for i in range(len(anomaly_count))]
with open(SIGNATURES_PATH.replace('key', key_ + " TIMESTAMPS"), 'wb') as f:
pickle.dump(timestamps, f)
with open(SIGNATURES_PATH.replace('key', key_ + " STREAMS"), 'wb') as f:
pickle.dump(streams, f)
if anomalous_IDs:
with open(SIGNATURES_PATH.replace('key', key_ + ' LABELS'), 'wb') as f:
pickle.dump(labels, f)
for depth in [2, 3]:
key__ = key_ + ', Depth = ' + str(depth)
hyperparameters.depth = depth
streams_signatures = calculate_signatures(streams, hyperparameters)
with open(SIGNATURES_PATH.replace('key', key__), 'wb') as f:
pickle.dump(streams_signatures, f)
We first calculate signatures for 5 biggest UK Airports and then try to cluster them using k-means algorithm
hyperparameters.observed = tuple(["EGLL", "EGCC", "EGPF", "EGPH", "EGKK"])
calculate_signatures_by_hyperparameters(events, LOCKDOWN_SIGNATURES_PATH)
calculating signatures ................................
lockdown_labels_vs_hyperparameters = {}
lockdown_timestamps_vs_hyperparameters = {}
for stream_spacing in [3]:
for stream_length in [10, 20]:
hyperkey = ', Length=' + str(stream_length) + ', Spacing = '+ str(stream_spacing)
labels = {}
timestamps = {}
for scaling in [True, False]:
for mode in ["edges", "vertices"]:
for param_combination in itertools.product(*([[False, True]] * 3)):
if mode == "vertices":
timestamps_key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling) + hyperkey + ', Mode = ' + mode
for depth in [2, 3]:
file_key = timestamps_key + ', Depth = ' + str(depth)
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling)+ ', Mode = ' + mode + ', Depth = ' + str(depth)
with open(LOCKDOWN_SIGNATURES_PATH.replace('key', timestamps_key + " TIMESTAMPS"), 'rb') as f_timestamps:
with open(LOCKDOWN_SIGNATURES_PATH.replace('key', file_key), 'rb') as f_signatures:
signatures = pickle.load(f_signatures)
timestamps[key] = pickle.load(f_timestamps)
labels[key] = clustering.kmeans(signatures)
else:
for directional_edges in [True, False]:
timestamps_key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling)+ hyperkey + ', Mode = ' + mode + ', Directional_edges = {}'.format(directional_edges)
for depth in [2, 3]:
file_key = timestamps_key + ', Depth = ' + str(depth)
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling)+ ', Mode = ' + mode + ', Directional_edges = {}'.format(directional_edges) + ', Depth = ' + str(depth)
with open(LOCKDOWN_SIGNATURES_PATH.replace('key', timestamps_key + " TIMESTAMPS"), 'rb') as f_timestamps:
with open(LOCKDOWN_SIGNATURES_PATH.replace('key', file_key), 'rb') as f_signatures:
signatures = pickle.load(f_signatures)
timestamps[key] = pickle.load(f_timestamps)
labels[key] = clustering.kmeans(signatures)
lockdown_labels_vs_hyperparameters[hyperkey] = labels
lockdown_timestamps_vs_hyperparameters[hyperkey] = timestamps
fontsize=13
n_subplots_per_row = 5
for key in lockdown_labels_vs_hyperparameters.keys():
print(key)
labels = lockdown_labels_vs_hyperparameters[key]
timestamps = lockdown_timestamps_vs_hyperparameters[key]
n_rows = int(len(labels.keys()) / n_subplots_per_row)
fig, ax = plt.subplots(n_rows, n_subplots_per_row, figsize=(13.5, n_rows * 2.0), squeeze=True)
for key_, ax in zip(labels.keys(), ax.flatten()):
ax.scatter(timestamps[key_], labels[key_])
ax.set_title(key_.replace(",", ",\n"), fontsize=fontsize)
ax.set_xlabel('Time', fontsize=fontsize)
ax.set_ylabel('Classification', fontsize=fontsize)
#ax.tick_params(axis='both', which='major', labelsize=fontsize)
#ax.tick_params(axis='both', which='minor', labelsize=fontsize)
#plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.subplots_adjust(hspace = 1.5, top = 2, right = 1.5)
plt.show()
, Length=10, Spacing = 3
, Length=20, Spacing = 3
As we can clearly see, some combinations of hyperparameters fail to produce meaningful results. However there are 2 change points consistently detected with some hyperparameters. These correspond to the dates when UK lockdown policy has been significantly changed, as expected.
#Change points detected
print(datetime.fromtimestamp(1585636381))
print(datetime.fromtimestamp(1573453328))
2020-03-31 06:33:01 2019-11-11 06:22:08
The methodology here is identical to the one used to detect lockdowns in UK
Airports investigated for malicious activity:
LTFM, OSDI, HLZN, DRRN, OJAI, LTBA, HLLS, EBOS, LMML, DTTX, HLKF, HLMS, HLLB, HEBA, DTTA
We can only investigate few of those airports at any given time due to dimensionality issues.
hyperparameters.observed = tuple(["LTFM", "OSDI", "HLZN", "DRRN", "EBOS", "LMML", "DTTX"])
query2= '''SELECT origin_icao, date, time, event, flight_id
FROM Flights
WHERE destination_icao IN {} AND origin_icao IN {} AND event = 'takeoff' AND date < '2020-04-01'
UNION
SELECT destination_icao, date, time, event, flight_id
FROM Flights
WHERE destination_icao IN {} AND origin_icao IN {} AND event = 'landed' AND date < '2020-04-01' '''.format(*([hyperparameters.observed] * 4) )
conn = sqlite3.connect("file:flights.db?nolock=1", uri=True)
events2 = pd.read_sql_query(query2, conn)
conn.close()
def datetime_convert(row):
date = row['date']
time = row['time']
date = date[2:]
date_time = date + " " + time
date_time_obj = datetime.strptime(date_time, '%y-%m-%d %H:%M:%S')
return int(round(date_time_obj.timestamp()))
events2['timestamp'] = events2.apply(lambda row: datetime_convert(row), axis = 1)
events2 = events2[['origin_icao', 'timestamp', "event", "flight_id"]]
events2 = events2.rename(columns = {'origin_icao': 'icao'})
events2 = events2.rename(columns = {'flight_id': 'id'})
events2 = events2.sort_values("timestamp")
hyperparameters.observed = tuple(["LTFM", "OSDI", "HLZN", "DRRN", "EBOS", "LMML", "DTTX"])
calculate_signatures_by_hyperparameters(events2, CONFLICT_SIGNATURES_PATH)
calculating signatures ................................
conflict_labels_vs_hyperparameters = {}
conflict_timestamps_vs_hyperparameters = {}
for stream_spacing in [3]:
for stream_length in [10, 20]:
hyperkey = ', Length=' + str(stream_length) + ', Spacing = '+ str(stream_spacing)
labels = {}
timestamps = {}
for scaling in [True, False]:
for mode in ["edges", "vertices"]:
for param_combination in itertools.product(*([[False, True]] * 3)):
if mode == "vertices":
timestamps_key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling) + hyperkey + ', Mode = ' + mode
for depth in [2, 3]:
file_key = timestamps_key + ', Depth = ' + str(depth)
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling)+ ', Mode = ' + mode + ', Depth = ' + str(depth)
with open(CONFLICT_SIGNATURES_PATH.replace('key', timestamps_key + " TIMESTAMPS"), 'rb') as f_timestamps:
with open(CONFLICT_SIGNATURES_PATH.replace('key', file_key), 'rb') as f_signatures:
signatures = pickle.load(f_signatures)
timestamps[key] = pickle.load(f_timestamps)
labels[key] = clustering.kmeans(signatures)
else:
for directional_edges in [True, False]:
timestamps_key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling)+ hyperkey + ', Mode = ' + mode + ', Directional_edges = {}'.format(directional_edges)
for depth in [2, 3]:
file_key = timestamps_key + ', Depth = ' + str(depth)
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling)+ ', Mode = ' + mode + ', Directional_edges = {}'.format(directional_edges) + ', Depth = ' + str(depth)
with open(CONFLICT_SIGNATURES_PATH.replace('key', timestamps_key + " TIMESTAMPS"), 'rb') as f_timestamps:
with open(CONFLICT_SIGNATURES_PATH.replace('key', file_key), 'rb') as f_signatures:
signatures = pickle.load(f_signatures)
timestamps[key] = pickle.load(f_timestamps)
labels[key] = clustering.kmeans(signatures)
conflict_labels_vs_hyperparameters[hyperkey] = labels
conflict_timestamps_vs_hyperparameters[hyperkey] = timestamps
fontsize=13
n_subplots_per_row = 5
for key in conflict_labels_vs_hyperparameters.keys():
print(key)
labels = conflict_labels_vs_hyperparameters[key]
timestamps = conflict_timestamps_vs_hyperparameters[key]
n_rows = int(len(labels.keys()) / n_subplots_per_row)
fig, ax = plt.subplots(n_rows, n_subplots_per_row, figsize=(13.5, n_rows * 2.0), squeeze=True)
for key_, ax in zip(labels.keys(), ax.flatten()):
ax.scatter(timestamps[key_], labels[key_])
ax.set_title(key_.replace(",", ",\n"), fontsize=fontsize)
ax.set_xlabel('Time', fontsize=fontsize)
ax.set_ylabel('Classification', fontsize=fontsize)
#ax.tick_params(axis='both', which='major', labelsize=fontsize)
#ax.tick_params(axis='both', which='minor', labelsize=fontsize)
#plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.subplots_adjust(hspace = 1.5, top = 2, right = 1.5)
plt.show()
, Length=10, Spacing = 3
, Length=20, Spacing = 3
Again, here we can see that most hyperparameter combinations fail to produce significant results, but the ones that produce a clear change point detect a date consistent with when Russian involvment is believed to have started.
# Detected change points
print(datetime.fromtimestamp(1577604863))
print(datetime.fromtimestamp(1573186141))
2019-12-29 07:34:23 2019-11-08 04:09:01
As seen above, the results produced by analysing real life data are messy, depend heavily on use of hyperparameters and can only be evaluated subjectively with how "clean" the change point is and how well it corresponds to when the anomalous time period is believed to have started. Hence, we will use dummy data to analyse performance of our vectorisation with off-the-shelf anomaly detection algorithms on a labeled data with known anomalies.
To do so, we will compute subpaths and signatures across different combinations of hyperparameters for our data and plot ROC curve for each of those, using area under the ROC curve as the performance metric. We will also test algorithm on both data with 3 kind of anomaly and with 1 kind, under the conjecture that with fewer types of anomalies it will be easier to detect them.
hyperparameters.observed = tuple([np.int_(0), np.int_(1), np.int_(2), np.int_(3), np.int_(4)])
calculate_signatures_by_hyperparameters(dummy_events, DUMMY_SIGNATURES_PATH, dummy_anomalous_IDs)
print()
calculate_signatures_by_hyperparameters(dummy_clean_events, DUMMY_CLEAN_SIGNATURES_PATH)
calculating signatures ................................calculating signatures ................................
knn_prediction_vs_hyperparameters = {}
labels_vs_hyperparameters = {}
for stream_spacing in [3]:
for stream_length in [10]:
knn_prediction ={}
knn_labels = {}
for scaling in [True, False]:
for mode in ["edges", "vertices"]:
for param_combination in itertools.product(*([[False, True]] * 3)):
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling) + ', Length=' + str(stream_length) + ', Spacing = ' + str(stream_spacing) + ', Mode = ' + mode
print('.', end='')
for depth in [2, 3]:
if mode == "edges":
for directional_edges in [True, False]:
hyperparameters.directional_edges = directional_edges
key_ = key + ', Directional_edges = {}'.format(directional_edges)
key__ = key_ + ', Depth = ' + str(depth)
with open(DUMMY_SIGNATURES_PATH.replace('key', key__), 'rb') as f_sig:
with open(DUMMY_SIGNATURES_PATH.replace('key', key_ + ' LABELS'), 'rb') as f_label:
signatures = pickle.load(f_sig)
labels = pickle.load(f_label)
pred = outlier_calc.knn(signatures, hyperparameters)
knn_prediction[key_] = pred
knn_labels[key_] = labels
else:
key_ = key + ', Depth = ' + str(depth)
with open(DUMMY_SIGNATURES_PATH.replace('key', key_), 'rb') as f_sig:
with open(DUMMY_SIGNATURES_PATH.replace('key', key + ' LABELS'), 'rb') as f_label:
signatures = pickle.load(f_sig)
labels = pickle.load(f_label)
pred = outlier_calc.knn(signatures, hyperparameters)
knn_prediction[key_] = pred
knn_labels[key_] = labels
key2 = "Stream length = {} days".format(stream_length)
knn_prediction_vs_hyperparameters[key2] = knn_prediction
labels_vs_hyperparameters[key2] = knn_labels
................................
for key in knn_prediction_vs_hyperparameters.keys():
print(key)
outlier_calc.plot_roc_curve(knn_prediction_vs_hyperparameters[key], labels_vs_hyperparameters[key])
Stream length = 10 days
from generator import generate_data_only_chaining
generate_data_only_chaining()
dummy_anomalous_events_labels2 = pd.read_csv('data_chaining_only_anomaly_labels.csv')
dummy_anomalous_IDs2 = set(dummy_anomalous_events_labels2['id'].values.tolist())
dummy_events2 = pd.read_csv('data_chaining_only.csv')
hyperparameters.observed = tuple([np.int_(0), np.int_(1), np.int_(2), np.int_(3), np.int_(4)])
calculate_signatures_by_hyperparameters(dummy_events2, DUMMY_SIGNATURES_PATH2, dummy_anomalous_IDs2)
calculating signatures ................................
knn_prediction_vs_hyperparameters2 = {}
labels_vs_hyperparameters2 = {}
for stream_spacing in [3]:
for stream_length in [10]:
knn_prediction ={}
knn_labels = {}
for scaling in [True, False]:
for mode in ["edges", "vertices"]:
for param_combination in itertools.product(*([[False, True]] * 3)):
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation={}'.format(*param_combination)+', Data Scaling={}'.format(scaling) + ', Length=' + str(stream_length) + ', Spacing = ' + str(stream_spacing) + ', Mode = ' + mode
print('.', end='')
for depth in [2, 3]:
if mode == "edges":
for directional_edges in [True, False]:
hyperparameters.directional_edges = directional_edges
key_ = key + ', Directional_edges = {}'.format(directional_edges)
key__ = key_ + ', Depth = ' + str(depth)
with open(DUMMY_SIGNATURES_PATH2.replace('key', key__), 'rb') as f_sig:
with open(DUMMY_SIGNATURES_PATH2.replace('key', key_ + ' LABELS'), 'rb') as f_label:
signatures = pickle.load(f_sig)
labels = pickle.load(f_label)
pred = outlier_calc.knn(signatures, hyperparameters)
knn_prediction[key_] = pred
knn_labels[key_] = labels
else:
key_ = key + ', Depth = ' + str(depth)
with open(DUMMY_SIGNATURES_PATH2.replace('key', key_), 'rb') as f_sig:
with open(DUMMY_SIGNATURES_PATH2.replace('key', key + ' LABELS'), 'rb') as f_label:
signatures = pickle.load(f_sig)
labels = pickle.load(f_label)
pred = outlier_calc.knn(signatures, hyperparameters)
knn_prediction[key_] = pred
knn_labels[key_] = labels
key2 = "Stream length = {} days".format(stream_length)
knn_prediction_vs_hyperparameters2[key2] = knn_prediction
labels_vs_hyperparameters2[key2] = knn_labels
................................
for key in knn_prediction_vs_hyperparameters2.keys():
print(key)
outlier_calc.plot_roc_curve(knn_prediction_vs_hyperparameters2[key], labels_vs_hyperparameters2[key])
Stream length = 10 days
You can read mroe about conformance score in the following paper : https://arxiv.org/abs/2006.03487v1
The code generating the scores is taken directly from that paper.
conformance_prediction_vs_hyperparameters = {}
conformance_labels_vs_hyperparameters = {}
for stream_spacing in [3]:
for stream_length in [10]:
conformance_prediction ={}
conformance_labels = {}
for scaling in [False]:
for mode in ["edges", "vertices"]:
for param_combination in itertools.product(*([[False, True]] * 2)):
key = 'LeadLag=False, Daytime={}, Time={}, Extrapollation=False'.format(*param_combination)+', Data Scaling={}'.format(scaling) + ', Length=' + str(stream_length) + ', Spacing = ' + str(stream_spacing) + ', Mode = ' + mode
if mode == "edges":
for directional_edges in [True, False]:
hyperparameters.directional_edges = directional_edges
key_ = key + ', Directional_edges = {}'.format(directional_edges)
with open(DUMMY_SIGNATURES_PATH2.replace('key', key_ + " STREAMS"), 'rb') as f_streams:
with open(DUMMY_SIGNATURES_PATH2.replace('key', key_ + ' LABELS'), 'rb') as f_label:
with open(DUMMY_CLEAN_SIGNATURES_PATH.replace('key', key_ + " STREAMS"), 'rb') as f_clean_sig:
clean_streams = pickle.load(f_clean_sig)
streams = pickle.load(f_streams)
labels = pickle.load(f_label)
for depth in [2]:
key__ = key_ + ', Depth = ' + str(depth)
pred = variance(streams, clean_streams, depth)
conformance_prediction[key_] = pred
conformance_labels[key_] = labels
else:
key_ = key
with open(DUMMY_SIGNATURES_PATH2.replace('key', key_ + " STREAMS"), 'rb') as f_streams:
with open(DUMMY_SIGNATURES_PATH2.replace('key', key_ + ' LABELS'), 'rb') as f_label:
with open(DUMMY_CLEAN_SIGNATURES_PATH.replace('key', key_ + " STREAMS"), 'rb') as f_clean_streams:
clean_streams = pickle.load(f_clean_streams)
streams = pickle.load(f_streams)
labels = pickle.load(f_label)
for depth in [2]:
key__ = key_ + ', Depth = ' + str(depth)
pred = variance(streams, clean_streams, depth)
conformance_prediction[key_] = pred
conformance_labels[key_] = labels
key2 = "Stream length = {} days".format(stream_length)
conformance_prediction_vs_hyperparameters[key2] = conformance_prediction
conformance_labels_vs_hyperparameters[key2] = conformance_labels
Computing signatures: 100%|██████████| 2950/2950 [03:58<00:00, 12.34it/s] Building shuffle matrix: 100%|██████████| 111/111 [00:00<00:00, 855.79it/s] Computing variances: 100%|██████████| 869/869 [00:29<00:00, 29.82it/s] Computing signatures: 100%|██████████| 2950/2950 [03:58<00:00, 12.34it/s] Building shuffle matrix: 100%|██████████| 111/111 [00:00<00:00, 850.78it/s] Computing variances: 100%|██████████| 869/869 [00:27<00:00, 31.23it/s] Computing signatures: 100%|██████████| 2950/2950 [05:44<00:00, 8.55it/s] Building shuffle matrix: 100%|██████████| 133/133 [00:00<00:00, 688.64it/s] Computing variances: 100%|██████████| 869/869 [00:30<00:00, 28.70it/s] Computing signatures: 100%|██████████| 2950/2950 [05:44<00:00, 8.56it/s] Building shuffle matrix: 100%|██████████| 133/133 [00:00<00:00, 710.28it/s] Computing variances: 100%|██████████| 869/869 [00:30<00:00, 28.91it/s] Computing signatures: 100%|██████████| 2950/2950 [05:44<00:00, 8.56it/s] Building shuffle matrix: 100%|██████████| 133/133 [00:00<00:00, 709.63it/s] Computing variances: 100%|██████████| 869/869 [00:29<00:00, 29.52it/s] Computing signatures: 100%|██████████| 2950/2950 [05:50<00:00, 8.41it/s] Building shuffle matrix: 100%|██████████| 133/133 [00:00<00:00, 661.20it/s] Computing variances: 100%|██████████| 869/869 [00:30<00:00, 28.56it/s] Computing signatures: 100%|██████████| 2950/2950 [07:08<00:00, 6.88it/s] Building shuffle matrix: 100%|██████████| 157/157 [00:00<00:00, 466.74it/s] Computing variances: 100%|██████████| 869/869 [00:32<00:00, 27.12it/s] Computing signatures: 100%|██████████| 2950/2950 [07:08<00:00, 6.88it/s] Building shuffle matrix: 100%|██████████| 157/157 [00:00<00:00, 591.78it/s] Computing variances: 100%|██████████| 869/869 [00:32<00:00, 26.98it/s] Computing signatures: 100%|██████████| 2950/2950 [03:59<00:00, 12.33it/s] Building shuffle matrix: 100%|██████████| 111/111 [00:00<00:00, 861.81it/s] Computing variances: 100%|██████████| 869/869 [00:29<00:00, 29.30it/s] Computing signatures: 100%|██████████| 2950/2950 [05:45<00:00, 8.54it/s] Building shuffle matrix: 100%|██████████| 133/133 [00:00<00:00, 707.74it/s] Computing variances: 100%|██████████| 869/869 [00:29<00:00, 29.37it/s] Computing signatures: 100%|██████████| 2950/2950 [05:45<00:00, 8.54it/s] Building shuffle matrix: 100%|██████████| 133/133 [00:00<00:00, 709.83it/s] Computing variances: 100%|██████████| 869/869 [00:30<00:00, 28.65it/s] Computing signatures: 100%|██████████| 2950/2950 [07:09<00:00, 6.86it/s] Building shuffle matrix: 100%|██████████| 157/157 [00:00<00:00, 503.03it/s] Computing variances: 100%|██████████| 869/869 [00:31<00:00, 27.49it/s]
for key in conformance_prediction_vs_hyperparameters.keys():
print(key)
outlier_calc.plot_roc_curve(conformance_prediction_vs_hyperparameters[key], conformance_labels_vs_hyperparameters[key])
Stream length = 10 days